Llane/sdg ray docs #1347

lbliii · 2026-01-02T16:01:31Z

initial pass at creating SDG docs

Note: I'll be out next week, but feel free to leave any changes and i'll get to them ASAP

Signed-off-by: Lawrence Lane <[email protected]>

Copilot

Pull request overview

This PR adds comprehensive documentation for the new Ray-based Synthetic Data Generation (SDG) capabilities in NeMo Curator. The documentation covers both simple multilingual Q&A generation and advanced NemotronCC pipelines for text transformation and knowledge extraction.

Key Changes

Added tutorial README with quick start examples and command-line reference for all SDG scripts
Created comprehensive documentation structure covering LLM client configuration, multilingual Q&A tutorials, and NemotronCC pipeline workflows
Updated release notes to reflect SDG feature availability and removed the previous limitation note about SDG being under refactoring

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tutorials/synthetic/README.md	Enhanced tutorial README with detailed usage examples, command-line arguments table, and links to documentation
docs/index.md	Added Synthetic Data section to the main documentation table of contents
docs/curate-text/synthetic/index.md	Created overview page explaining SDG architecture, use cases, and available stages with mermaid diagram
docs/curate-text/synthetic/llm-client.md	Added comprehensive LLM client configuration guide covering NVIDIA API, vLLM, TGI endpoints with performance tuning
docs/curate-text/synthetic/multilingual-qa.md	Created step-by-step tutorial for generating multilingual Q&A pairs with code examples and CLI reference
docs/curate-text/synthetic/nemotron-cc/index.md	Documented NemotronCC pipeline architecture with composable pattern explanation and task configuration
docs/curate-text/synthetic/nemotron-cc/tasks.md	Created detailed reference for all five NemotronCC tasks with prompt templates and post-processing logic
docs/curate-text/index.md	Added Synthetic Data Generation card to the text curation index page
docs/about/release-notes/index.md	Added SDG feature announcement and removed previous limitation note

tutorials/synthetic/README.md

docs/curate-text/synthetic/nemotron-cc/index.md

Copilot · 2026-01-02T16:04:51Z

docs/curate-text/synthetic/multilingual-qa.md

+  - NVIDIA API
+  - Base URL for the API endpoint
+* - `--model-name`
+  - llama-3.3-70b


The default value "llama-3.3-70b" doesn't match the actual default used in the example script (synthetic_data_generation_example.py), which is "meta/llama-3.3-70b-instruct". Update this to match the actual implementation for consistency.

Suggested change

- llama-3.3-70b

- meta/llama-3.3-70b-instruct

Copilot · 2026-01-02T16:04:51Z

tutorials/synthetic/README.md

+## Command-Line Arguments
+


The section header "Command-Line Arguments" discusses arguments across different scripts, but the title suggests these are universal. Consider adding clarifying text that differentiates between common arguments (used by multiple scripts) and script-specific arguments, or rename to "Command-Line Reference" for better clarity.

Suggested change

## Command-Line Arguments

## Command-Line Reference

The arguments below are grouped into options shared across multiple example scripts and options specific to particular NemotronCC pipelines. Not every argument applies to every tutorial; refer to each script's `--help` output for the complete, authoritative list.

greptile-apps · 2026-01-02T16:07:31Z

Greptile Summary

This PR adds comprehensive documentation for synthetic data generation (SDG) capabilities in NeMo Curator. The documentation includes a well-structured overview, LLM client configuration guide, multilingual Q&A tutorial, and detailed NemotronCC pipeline documentation with clear architecture diagrams and task references.

Major additions:

SDG overview page with architecture diagram, use cases, and stage comparison
LLM client configuration guide with performance tuning and troubleshooting
Multilingual Q&A generation tutorial with step-by-step examples
NemotronCC documentation covering 5 specialized tasks (WikiPara, DiverseQA, Distill, ExtractKnowledge, KnowledgeList)
Enhanced tutorial README with comprehensive CLI examples

Critical issue:

Release notes (docs/about/release-notes/index.md) were drastically reduced, removing comprehensive v26.02 information about Docker, PyPI, video/audio modalities, and architecture changes. The SDG content should be added as a new section rather than replacing existing release information.

Confidence Score: 3/5

This PR is generally safe but requires fixing the release notes before merging to avoid losing critical release information
The documentation is well-written and comprehensive, but the release notes issue is a critical problem that removes important information users need. The SDG docs themselves are high quality with no logical issues.
Pay close attention to docs/about/release-notes/index.md which needs the original v26.02 release content restored

Important Files Changed

Filename	Overview
docs/about/release-notes/index.md	Release notes drastically reduced from 231 lines to 44 lines, removing comprehensive v26.02 release information about Docker, PyPI, video/audio modalities, deduplication, and architecture changes
docs/curate-text/synthetic/index.md	New comprehensive overview page for synthetic data generation with clear architecture diagrams, use cases, and stage comparison table
docs/curate-text/synthetic/llm-client.md	Well-structured LLM client configuration guide with comprehensive parameter documentation, examples, and troubleshooting tips
docs/curate-text/synthetic/multilingual-qa.md	Step-by-step tutorial for multilingual Q&A generation with clear code examples, CLI usage, and sample output
docs/curate-text/synthetic/nemotron-cc/index.md	Comprehensive NemotronCC pipeline documentation with clear architecture diagrams, task comparison table, and quality-based processing strategy
docs/curate-text/synthetic/nemotron-cc/tasks.md	Detailed task reference with prompt templates, configuration examples, and post-processing details for each NemotronCC stage

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline
    participant LLMClient
    participant NVAPI as NVIDIA API/vLLM
    participant Output
    
    User->>Pipeline: Create SDG Pipeline
    User->>Pipeline: Add QAMultilingualSyntheticStage or NemotronCC Stage
    User->>Pipeline: Configure AsyncOpenAIClient
    
    User->>Pipeline: pipeline.run()
    Pipeline->>LLMClient: Initialize client with rate limiting
    
    alt Multilingual Q&A
        Pipeline->>LLMClient: Generate Q&A pairs in languages
        LLMClient->>NVAPI: Async API calls (max_concurrent_requests)
        NVAPI-->>LLMClient: Generated Q&A responses
        LLMClient->>Pipeline: Return DocumentBatch
        Pipeline->>Pipeline: Apply language filters (optional)
    else NemotronCC Pipeline
        Pipeline->>Pipeline: Preprocessing (tokenize, segment, filter)
        Pipeline->>LLMClient: Transform documents via LLM
        LLMClient->>NVAPI: Batch API calls with retry logic
        NVAPI-->>LLMClient: Transformed text (paraphrased/QA/distilled)
        LLMClient->>Pipeline: Return transformed data
        Pipeline->>Pipeline: Postprocessing (cleanup, quality filter)
    end
    
    Pipeline->>Output: Write to JSONL/Parquet
    Output-->>User: Generated synthetic data

greptile-apps

Additional Comments (1)

docs/about/release-notes/index.md, line 21 (link)

style: The "What's Next" section placeholder needs completion before release

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

huvunvidia

Have a very quick pass. Mainly to determine features/content that are generated by AI agent without manually tested.

tutorials/synthetic/README.md

docs/curate-text/synthetic/llm-client.md

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps · 2026-01-02T20:45:34Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T16:02:19Z

docs/curate-text/synthetic/index.md

+2. **NeMo Curator with text extras**
+
+   ```bash
+   uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]


The installation command is inconsistent with the standard format used elsewhere in the documentation. The --extra-index-url https://pypi.nvidia.com flag is not needed with uv pip install, and the package name should be quoted.

The command should match the format used in other quickstart guides (e.g., docs/get-started/text.md line 52):

Suggested change

uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]

uv pip install "nemo-curator[text_cuda12]"

This ensures consistency with the rest of the documentation and follows the recommended installation pattern.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T16:04:41Z

tutorials/synthetic/README.md


+- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
+- [NemotronCC Pipeline Documentation](../../docs/curate-text/synthetic/nemotron-cc/index.md)
+- [Task Reference](../../docs/curate-text/synthetic/nemotron-cc/tasks.md)


Missing newline at end of file. Add a trailing newline for POSIX compliance.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T16:58:36Z

docs/about/release-notes/index.md

 # NeMo Curator Release Notes: {{ current_release }}

-This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](../../curate-video/index.md) and [audio](../../curate-audio/index.md) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
+## Synthetic Data Generation

-**Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions.
+New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs:

-## Installation Updates
+- **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff
+- **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts
+- **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows:
+  - **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose
+  - **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training
+  - **Distill**: Create condensed, information-dense paraphrases preserving key concepts
+  - **Extract Knowledge**: Extract factual content as textbook-style passages
+  - **Knowledge List**: Extract structured fact lists from documents

- **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`)
- **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support
- **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management
- **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality:
+Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md).

-  ```{list-table} Available Installation Extras
-  :header-rows: 1
-  :widths: 25 35 40
-
-  * - Extra
-    - Installation Command
-    - Description
-  * - **All Modalities**
-    - `nemo-curator[all]`
-    - Complete installation with all modalities and GPU support
-  * - **Text Curation**
-    - `nemo-curator[text_cuda12]`
-    - GPU-accelerated text processing with RAPIDS
-  * - **Image Curation**
-    - `nemo-curator[image_cuda12]`
-    - Image processing with NVIDIA DALI
-  * - **Audio Curation**
-    - `nemo-curator[audio_cuda12]`
-    - Speech recognition with NeMo ASR models
-  * - **Video Curation**
-    - `nemo-curator[video_cuda12]`
-    - Video processing with GPU acceleration
-  * - **Basic GPU**
-    - `nemo-curator[cuda12]`
-    - CUDA utilities without modality-specific dependencies
-  ```
-
-  All GPU installations require the NVIDIA PyPI index:
-  ```bash
-  uv pip install https://pypi.nvidia.com nemo-curator[EXTRA]
-  ```
-
-## New Modalities
-
-### Video
-
-NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities:
-
- **Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction
- **Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal
- **Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement
- **Embedding generation**: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings
- **Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions
- **Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md)
-
-### Audio
-
-New [audio curation capabilities](../../curate-audio/index.md) for speech data processing:
-
- **ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models
- **Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation
- **Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second)
- **Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage`
- **Manifest support**: JSONL manifest format for audio file management
-
-## Modality Refactors
-
-### Text
-
- **Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md)
- **Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization
- **Task-centric architecture**: New `Task`-based processing model for finer-grained control
- **Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification
-
-### Image
-
- **Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages
- **DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
- **Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md)
- **Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores
-
-Learn more about [image curation](../../curate-images/index.md).
-
-## Deduplication Improvements
-
-Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
-
- **Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities
- **Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows
- **New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs
-
-## Core Refactors
-
-The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
-
-```{mermaid}
-graph LR
-    subgraph "User Layer"
-        P[Pipeline]
-        S1[ProcessingStage X→Y]
-        S2[ProcessingStage Y→Z]
-        S3[ProcessingStage Z→W]
-        R[Resources<br/>CPU/GPU/NVDEC/NVENC]
-    end
-    
-    subgraph "Orchestration Layer"
-        BE[BaseExecutor Interface]
-    end
-    
-    subgraph "Backend Layer"
-        XE[XennaExecutor<br/>Production Ready]
-        RAP[RayActorPoolExecutor<br/>Experimental]
-        RDE[RayDataExecutor<br/>Experimental]
-    end
-    
-    subgraph "Adaptation Layer"
-        XA[Xenna Adapter]
-        RAPA[Ray Actor Adapter]
-        RDA[Ray Data Adapter]
-    end
-    
-    subgraph "Execution Layer"
-        X[Cosmos-Xenna<br/>Streaming/Batch]
-        RAY1[Ray Actor Pool<br/>Load Balancing]
-        RAY2[Ray Data API<br/>Dataset Processing]
-    end
-    
-    P --> S1
-    P --> S2
-    P --> S3
-    S1 -.-> R
-    S2 -.-> R
-    S3 -.-> R
-    
-    P --> BE
-    BE --> XE
-    BE --> RAP
-    BE --> RDE
-    
-    XE --> XA
-    RAP --> RAPA
-    RDE --> RDA
-    
-    XA --> X
-    RAPA --> RAY1
-    RDA --> RAY2
-    
-    style XE fill:#90EE90
-    style RAP fill:#FFE4B5
-    style RDE fill:#FFE4B5
-    style P fill:#E6F3FF
-    style BE fill:#F0F8FF
-```
-
-### Pipelines
-
- **New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface
- **Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md)
- **Resource specification**: Configurable CPU and GPU memory requirements per stage
- **Stage composition**: Improved stage validation and execution orchestration
-
-### Stages
-
- **ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety
- **Resource requirements**: Built-in resource specification for CPU and GPU memory
- **Backend adapters**: Stage adaptation layer for different Ray orchestration systems
- **Input/output validation**: Enhanced type checking and data validation
-
-## Tutorials
-
- **Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API
- **Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend
- **Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio)
- **Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video)
-
-For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository.
-
-## Known Limitations
-
-> (Pending Refactor in Future Release)
-
-### Generation
-
- **Synthetic data generation**: Synthetic text generation features are being refactored for Ray compatibility
- **Hard negative mining**: Retrieval-based data generation workflows under development
-
-### PII
-
- **PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend
- **Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development
-
-### Blending & Shuffling
-
- **Data blending**: Multi-source dataset blending functionality being refactored
- **Dataset shuffling**: Large-scale data shuffling operations under development
-
-## Docs Refactor
-
- **Local preview capability**: Improved documentation build system with local preview support
- **Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md))
- **API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples

 ---

 ## What's Next

-The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.
+The next release will focus on ...


The release notes have been drastically reduced from 231 lines to just 44 lines, removing all the comprehensive release information about the v26.02 release. The original content included detailed information about:

Installation updates (Docker, UV, PyPI improvements)

New modalities (Video and Audio support)

Modality refactors (Text and Image improvements)

Deduplication improvements

Core architecture refactors (Pipeline and Stage redesigns)

Tutorials updates

Known limitations

This content should not be completely replaced. Instead, the SDG documentation should be added to the existing release notes as a new section, not used as a replacement for all the other release information.

The release notes serve as a critical reference for users understanding what changed in the release. Removing this information will confuse users and make it difficult for them to understand the full scope of the v26.02 release.

greptile-apps · 2026-01-12T16:58:37Z

tutorials/synthetic/README.md

-| **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` |
+---

+## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)


Missing newline after the "Additional Resources" heading. The markdown formatting is incorrect - there should be a blank line between the heading and the list items.

Suggested change

## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

## Additional Resources

- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-12T20:13:18Z

tutorials/synthetic/README.md

-| **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` |
+---

+## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)


Missing newline between heading and bullet list. Add a blank line after "## Additional Resources" for proper markdown rendering.

Suggested change

## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

## Additional Resources

- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

Signed-off-by: Lawrence Lane <[email protected]>

…urator into llane/sdg-ray-docs

greptile-apps

Additional Comments (1)

tutorials/synthetic/README.md, line 103 (link)

style: Missing trailing newline at end of file.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{12 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T19:55:51Z

docs/about/release-notes/index.md

+## Synthetic Data Generation

-**Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions.
+New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs:

-## Installation Updates
+- **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff
+- **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts
+- **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows:
+  - **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose
+  - **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training
+  - **Distill**: Create condensed, information-dense paraphrases preserving key concepts
+  - **Extract Knowledge**: Extract factual content as textbook-style passages
+  - **Knowledge List**: Extract structured fact lists from documents

- **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`)
- **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support
- **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management
- **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality:
+Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md).


logic: The release notes have been reduced from 231 lines to 44 lines, removing all comprehensive v26.02 release information including Docker updates, PyPI improvements, video/audio modalities, deduplication improvements, and architecture refactors. The SDG documentation should be added to existing release notes, not replace them entirely. Users need the full scope of v26.02 changes for understanding what's new in the release.

greptile-apps · 2026-01-13T19:55:52Z

tutorials/synthetic/README.md

-| **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` |
+---

+## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)


style: Missing blank line between heading and list.

Suggested change

## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

## Additional Resources

- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

satishra-ai · 2026-01-13T21:59:52Z

doc doesn't mention on how to generate the data required in below CLI:

Process Parquet input files

python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py
--task diverse_qa
--tokenizer meta-llama/Llama-3.3-70B-Instruct
--input-parquet-path ./my_data/*.parquet
--output-path ./synthetic_output
--output-format parquet

lbliii added 3 commits January 2, 2026 10:25

sdg ray docs init

8d42be8

Signed-off-by: Lawrence Lane <[email protected]>

header, tab fixes

bccea95

Signed-off-by: Lawrence Lane <[email protected]>

style guide

abd2209

Signed-off-by: Lawrence Lane <[email protected]>

lbliii requested a review from Copilot January 2, 2026 16:01

lbliii self-assigned this Jan 2, 2026

Copilot started reviewing on behalf of lbliii January 2, 2026 16:01 View session

release notes change, bump version

91f5f9a

Signed-off-by: Lawrence Lane <[email protected]>

Copilot AI reviewed Jan 2, 2026

View reviewed changes

greptile-apps bot reviewed Jan 2, 2026

View reviewed changes

huvunvidia reviewed Jan 2, 2026

View reviewed changes

feedback

9a29ce7

Signed-off-by: Lawrence Lane <[email protected]>

lbliii added 2 commits January 12, 2026 10:56

Merge branch 'main' into llane/sdg-ray-docs

8b6f531

readme

a4ae7a4

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

lbliii added 2 commits January 12, 2026 11:51

updates

2794210

Signed-off-by: Lawrence Lane <[email protected]>

updates

b237fcb

Signed-off-by: Lawrence Lane <[email protected]>

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

Merge branch 'main' into llane/sdg-ray-docs

eea5799

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

lbliii added 2 commits January 13, 2026 14:50

updates

777ea37

Signed-off-by: Lawrence Lane <[email protected]>

Merge branch 'llane/sdg-ray-docs' of https://github.com/lbliii/NeMo-C…

2891e8f

…urator into llane/sdg-ray-docs

copy-pr-bot bot temporarily deployed to test January 13, 2026 19:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 13, 2026 19:50 Inactive

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

ayushdg requested a review from satishra-ai January 13, 2026 21:00

-## Command-Line Arguments
+## Command-Line Reference
+The arguments below are grouped into options shared across multiple example scripts and options specific to particular NemotronCC pipelines. Not every argument applies to every tutorial; refer to each script's `--help` output for the complete, authoritative list.

	uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
	uv pip install "nemo-curator[text_cuda12]"

Llane/sdg ray docs #1347

Are you sure you want to change the base?

Llane/sdg ray docs #1347

Uh oh!

Conversation

lbliii commented Jan 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

huvunvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 2, 2026

Greptile's behavior is changing!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 2, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading